Data Analysis using Python
Given a DataFrame named df with columns City, Sales, and Quarter, write the pandas code to filter and display only the rows where the City is "New York" and Sales are greater than 50,000.
# Using boolean indexing with the & (AND) operator filtered_df = df[(df['City'] == 'New York') & (df['Sales'] > 50000)] print(filtered_df)
Create a DataFrame sales with columns: Product, Region, Sales, Profit. Add 5 rows of sample data. Display the first 3 rows.
import pandas as pd
data = {
'Product': ['A', 'B', 'C', 'A', 'B'],
'Region': ['North', 'South', 'North', 'East', 'South'],
'Sales': [100, 200, 150, 300, 250],
'Profit': [20, 40, 30, 60, 50]
}
sales = pd.DataFrame(data)
print(sales.head(3))
From the sales DataFrame, filter rows where Sales > 150 and Region == 'South'.
filtered = sales[(sales['Sales'] > 150) & (sales['Region'] == 'South')]
print(filtered)
Find the mean, sum, and standard deviation of Sales column.
mean_sales = sales['Sales'].mean()
sum_sales = sales['Sales'].sum()
std_sales = sales['Sales'].std()
print("Mean Sales:", mean_sales)
print("Sum Sales:", sum_sales)
print("Std Dev Sales:", std_sales)
For the Sales column, calculate a 2-period moving average.
sales['Sales_MA2'] = sales['Sales'].rolling(window=2).mean()
print(sales)
You have a DataFrame df tracking daily stock prices in a column named Close_Price.
# 1. Finding the median median_price = df['Close_Price'].median() # 2. Calculating the 7-day moving average df['7_Day_MA'] = df['Close_Price'].rolling(window=7).mean()
Using Sales and Profit columns, apply KMeans with 2 clusters.
from sklearn.cluster import KMeans
X = sales[['Sales', 'Profit']]
kmeans = KMeans(n_clusters=2, random_state=0)
sales['Cluster'] = kmeans.fit_predict(X)
print(sales)
You want to group your customers into 4 distinct segments using K-Means based on their Age and Spending_Score. Assuming your data is already scaled and stored in a variable X, write the scikit-learn code to initialize, fit, and predict these clusters.
from sklearn.cluster import KMeans # Initialize the model with 4 clusters kmeans = KMeans(n_clusters=4, random_state=42) # Fit the model and predict the cluster labels cluster_labels = kmeans.fit_predict(X) # (Optional) Save back to your original dataframe # df['Cluster'] = cluster_labelsYou are building a Decision Tree to predict whether a customer will "Churn" (Yes/No). Your features are in X and target is in y.
from sklearn.tree import DecisionTreeClassifier # 1. Initialize with max_depth limit to prevent overfitting dt_classifier = DecisionTreeClassifier(max_depth=5, random_state=42) # 2. Fit the model dt_classifier.fit(X, y)
In Market Basket Analysis, if a store has 1,000 total transactions, and Bread is bought in 200 of them, while Milk and Bread are bought together in 50 transactions:
You have a raw dataset of employee data. You want to see the total (sum) salary paid out, broken down by Department (as rows) and Gender (as columns). How would you set this up?
pivot_df = df.pivot_table(values='Salary', index='Department', columns='Gender', aggfunc='sum')
You have two sheets/DataFrames:
# Merging the two dataframes on the common column 'Customer_ID' updated_orders = orders.merge(customers[['Customer_ID', 'Customer_Name']], on='Customer_ID', how='left')
Using Sales and Profit as features, predict Region (convert to numeric first).
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
sales['Region_Code'] = le.fit_transform(sales['Region'])
X = sales[['Sales', 'Profit']]
y = sales['Region_Code']
model = DecisionTreeClassifier()
model.fit(X, y)
# Predict for a new sale (Sales=200, Profit=45)
print(model.predict([[200, 45]])) # Output: encoded region
Given a transaction dataset, find frequent itemsets with min support = 0.5.
from mlxtend.frequent_patterns import apriori, association_rules
# Sample transaction data in one-hot format
data = {'Milk': [1, 0, 1, 1],
'Bread': [1, 1, 0, 1],
'Butter': [0, 1, 1, 0]}
df = pd.DataFrame(data)
frequent = apriori(df, min_support=0.5, use_colnames=True)
rules = association_rules(frequent, metric="lift", min_threshold=1)
print(frequent)
print(rules)
Create a pivot table showing total Sales per Region and Product.
pivot = pd.pivot_table(sales, values='Sales', index='Region', columns='Product', aggfunc='sum')
print(pivot)
You have two DataFrames: sales (Product, Sales) and prices (Product, Price). Add Price to sales DataFrame using Product as key.
prices = pd.DataFrame({
'Product': ['A', 'B', 'C'],
'Price': [10, 15, 12]
})
sales_with_price = pd.merge(sales, prices, on='Product', how='left')
print(sales_with_price)